Predictive and Distributed Routing Balancing for HPC Clusters
نویسندگان
چکیده
Current parallel applications in parallel computing systems require an interconnection network to provide low and bounded communication delays. Communication characteristics such as traffic pattern and communication load change over time and, eventually, they may exceed network capacity causing congestion and performance degradation. Congestion control based on adaptive routing should be applied in order to adapt quickly to changing traffic conditions. Studies of parallel applications show repetitive behavior and that they can be characterized by a set of representative phases. This work presents a Predictive and Distributed Routing Balancing technique (PR-DRB) to control network congestion based on adaptive traffic distribution. PR-DRB uses speculative routing based on application repetitiveness. PR-DRB monitors messages latencies on routers and logs solutions to congestion, to quickly respond in future similar situations. Experimental results show that the predictive approach could be used to improve performance.
منابع مشابه
Randomized Load-balanced Routing for Fat-tree Networks
Fat-tree networks have been widely adopted to High Performance Computing (HPC) clusters and to Data Center Networks (DCN). These parallel systems usually have a large number of servers and hosts, which generate large volumes of highly-volatile traffic. Thus, distributed load-balancing routing design becomes critical to achieve high bandwidth utilization, and low-latency packet delivery. Existin...
متن کاملDesigning Anmpi-based Parallel and Distributed Machine Learning Platform on Large-scale Hpc Clusters
This paper presents the design of an MPI-based parallel and distributed machine learning platform on large-scale HPC clusters. Researchers and practitioners can implement easily a class of parallelizable machine learning algorithms on the platform, or port quickly an existing non-parallel implementation of a parallelizable algorithm to the platform with only minor modifications. Complicated fun...
متن کاملTowards Dynamic Load Balancing Using Page Migration and Loop Re-partitioning on Omni/SCASH
Increasingly large-scale clusters of SMPs continue to become majority platform in HPC field. Such a cluster environment, there may be load imbalances due to several reasons and mis-placement of data which bring performance bottlenecks. To overcome these problems, some dynamic load balancing mechanisms are needed. In this paper, we report our ongoing work on dynamic load balancing extention to O...
متن کاملMLCA: A Multi-Level Clustering Algorithm for Routing in Wireless Sensor Networks
Energy constraint is the biggest challenge in wireless sensor networks because the power supply of each sensor node is a battery that is not rechargeable or replaceable due to the applications of these networks. One of the successful methods for saving energy in these networks is clustering. It has caused that cluster-based routing algorithms are successful routing algorithm for these networks....
متن کاملDynamic Routing Balancing On InfiniBand Networks*
InfiniBand (IBA) technology was developed to address the performance issues associated with messages movement among Endnodes and computer I/O devices. However, InfiniBand is also widely deployed within high performance computing (HPC) clusters due to the high bandwidth and low message latency attributes it offers to inter-processor communication systems. An interconnection-network efficient des...
متن کامل